In unsupervised domain adaptation (UDA), a model trained on source data (e.g. synthetic) is adapted to target data (e.g. real-world) without access to target annotation. Most previous UDA methods struggle with classes that have a similar visual appearance on the target domain as no ground truth is available to learn the slight appearance differences. To address this problem, we propose a Masked Image Consistency (MIC) module to enhance UDA by learning spatial context relations of the target domain as additional clues for robust visual recognition. MIC enforces the consistency between predictions of masked target images, where random patches are withheld, and pseudo-labels that are generated based on the complete image by an exponential moving average teacher. To minimize the consistency loss, the network has to learn to infer the predictions of the masked regions from their context. Due to its simple and universal concept, MIC can be integrated into various UDA methods across different visual recognition tasks such as image classification, semantic segmentation, and object detection. MIC significantly improves the state-of-the-art performance across the different recognition tasks for synthetic-to-real, day-to-nighttime, and clear-to-adverse-weather UDA. For instance, MIC achieves an unprecedented UDA performance of 75.9 mIoU and 92.8% on GTA-to-Cityscapes and VisDA-2017, respectively, which corresponds to an improvement of +2.1 and +3.0 percent points over the previous state of the art. The implementation is available at https://github.com/lhoyer/MIC.
translated by 谷歌翻译
Real-time semantic segmentation has played an important role in intelligent vehicle scenarios. Recently, numerous networks have incorporated information from multi-size receptive fields to facilitate feature extraction in real-time semantic segmentation tasks. However, these methods preferentially adopt massive receptive fields to elicit more contextual information, which may result in inefficient feature extraction. We believe that the elaborated receptive fields are crucial, considering the demand for efficient feature extraction in real-time tasks. Therefore, we propose an effective and efficient architecture termed Dilation-wise Residual segmentation (DWRSeg), which possesses different sets of receptive field sizes within different stages. The architecture involves (i) a Dilation-wise Residual (DWR) module for extracting features based on different scales of receptive fields in the high level of the network; (ii) a Simple Inverted Residual (SIR) module that uses an inverted bottleneck structure to extract features from the low stage; and (iii) a simple fully convolutional network (FCN)-like decoder for aggregating multiscale feature maps to generate the prediction. Extensive experiments on the Cityscapes and CamVid datasets demonstrate the effectiveness of our method by achieving a state-of-the-art trade-off between accuracy and inference speed, in addition to being lighter weight. Without using pretraining or resorting to any training trick, we achieve 72.7% mIoU on the Cityscapes test set at a speed of 319.5 FPS on one NVIDIA GeForce GTX 1080 Ti card, which is significantly faster than existing methods. The code and trained models are publicly available.
translated by 谷歌翻译
Score-based modeling through stochastic differential equations (SDEs) has provided a new perspective on diffusion models, and demonstrated superior performance on continuous data. However, the gradient of the log-likelihood function, i.e., the score function, is not properly defined for discrete spaces. This makes it non-trivial to adapt \textcolor{\cdiff}{the score-based modeling} to categorical data. In this paper, we extend diffusion models to discrete variables by introducing a stochastic jump process where the reverse process denoises via a continuous-time Markov chain. This formulation admits an analytical simulation during backward sampling. To learn the reverse process, we extend score matching to general categorical data and show that an unbiased estimator can be obtained via simple matching of the conditional marginal distributions. We demonstrate the effectiveness of the proposed method on a set of synthetic and real-world music and image benchmarks.
translated by 谷歌翻译
关于对比学习的最新研究仅通过在医学图像分割的背景下利用很少的标签来实现出色的性能。现有方法主要关注实例歧视和不变映射。但是,他们面临三个常见的陷阱:(1)尾巴:医疗图像数据通常遵循隐式的长尾分配。盲目利用训练中的所有像素会导致数据失衡问题,并导致性能恶化; (2)一致性:尚不清楚分割模型是否由于不同解剖学特征之间的类内变化而学会了有意义但一致的解剖学特征; (3)多样性:整个数据集中的切片内相关性已得到明显降低的关注。这促使我们寻求一种有原则的方法来战略利用数据集本身,以发现不同解剖学观点的类似但不同的样本。在本文中,我们介绍了一种新型的半监督医学图像分割框架,称其为您自己的解剖结构(MONA),并做出了三个贡献。首先,先前的工作认为,每个像素对模型培训都同样重要。我们从经验上观察到,仅此单单就不太可能定义有意义的解剖特征,这主要是由于缺乏监督信号。我们通过使用更强大的数据增强和最近的邻居展示了学习不变的两个简单解决方案。其次,我们构建了一组目标,鼓励模型能够以无监督的方式将医学图像分解为解剖特征的集合。最后,我们在具有不同标记设置的三个基准数据集上的广泛结果验证了我们提出的MONA的有效性,该数据在不同的标签设置下实现了新的最新设置。
translated by 谷歌翻译
在连续空间中,已经对大都市杂货(M-H)算法进行了充分的研究,但在离散空间中缺乏类似的理解。最近,事实证明,一个本地平衡的建议(LBP)是渐进的最佳选择,但最佳缩放问题仍然开放。在本文中,我们首次确定离散空间中M-H的效率也可以以独立于目标分布的渐近可接受率来表征。此外,我们从理论和经验上验证了LBP和Randy Walk Metropolis(RWM)的最佳接受率分别为$ 0.574 $和0.234美元。这些结果还有助于确定LBP是渐近的$ o(n^\ frac {2} {3})$比RWM相对于模型尺寸$ n $更有效。了解最佳接受率的知识使人们可以在离散空间中自动调整提案分布的邻域大小,直接类似于连续空间中的尺寸控制。我们从经验上证明,这种适应性M-H采样可以在离散空间中的各种目标分布(包括训练深度能量模型)中的各种目标分布中进行稳健改进采样。
translated by 谷歌翻译
组合优化的硬度(CO)问题阻碍收集用于监督学习的解决方案。但是,由于缺乏标记的数据,因此很难学习CO问题的神经网络,因为训练很容易被捕获到本地Optima。在这项工作中,我们为CO问题提出了一个简单但有效的退火培训框架。特别是,我们将CO问题转化为公正的基于能量的模型(EBM)。我们仔细选择了罚款条款,以使EBM尽可能平滑。然后,我们训练图形神经网络以近似EBM。为了防止训练在初始化附近被卡在本地Optima上,我们引入了退火损失功能。实验评估表明,我们的退火训练框架获得了实质性改进。在四种类型的CO问题中,我们的方法在合成图和现实世界图上都比其他无监督神经方法更好地达到了性能。
translated by 谷歌翻译
科学数据中的关系,例如单变量数据中特征的数值和空间分布关系,多元数据中的标量值组合的关系以及时间变化和整体数据中的体积的关联,是复杂且复杂的。本文介绍了一种新型的无监督表示学习模型Voxel2Vec,该模型用于在低维矢量空间中学习标量值/标量值组合的分布式表示。它的基本假设是,如果两个标量值/标量值组合具有相似的上下文,则它们通常在特征方面具有很高的相似性。通过将标量值/标量值组合表示为符号,voxel2vec在空间分布的背景下了解它们之间的相似性,然后允许我们通过传输预测来探索卷之间的整体关联。我们通过将其与单变量数据的等速度相似性图进行比较,并将学习的分布式表示形式与多变量数据分类以及用于时间变化和集合数据的关联分析来证明voxel2vec的有用性和有效性。
translated by 谷歌翻译
最近,一个本地平衡(LB)的样本家族在离散空间中的采样和学习能量模型(EBM)方面表现出色。但是,对这一成功的理论理解是有限的。在这项工作中,我们展示了LB功能如何引起与离散空间中Wasserstein梯度流相对应的LB动力学。从第一原则来看,先前的LB采样器就可以看作是LB动力学相对于锤距的离散化。基于此观察结果,我们提出了一种新算法,即局部平衡跳跃(LBJ),通过将LB动力学相对于仿真时间离散。结果,LBJ具有位置依赖性的“速度”,使其可以提出更大距离的建议。此外,LBJ将每个维度分解为独立的子过程,从而实现方便的并行实现。我们证明了LBJ在各种二进制和分类分布中的采样和学习方面的优势。
translated by 谷歌翻译
When using LiDAR semantic segmentation models for safety-critical applications such as autonomous driving, it is essential to understand and improve their robustness with respect to a large range of LiDAR corruptions. In this paper, we aim to comprehensively analyze the robustness of LiDAR semantic segmentation models under various corruptions. To rigorously evaluate the robustness and generalizability of current approaches, we propose a new benchmark called SemanticKITTI-C, which features 16 out-of-domain LiDAR corruptions in three groups, namely adverse weather, measurement noise and cross-device discrepancy. Then, we systematically investigate 11 LiDAR semantic segmentation models, especially spanning different input representations (e.g., point clouds, voxels, projected images, and etc.), network architectures and training schemes. Through this study, we obtain two insights: 1) We find out that the input representation plays a crucial role in robustness. Specifically, under specific corruptions, different representations perform variously. 2) Although state-of-the-art methods on LiDAR semantic segmentation achieve promising results on clean data, they are less robust when dealing with noisy data. Finally, based on the above observations, we design a robust LiDAR segmentation model (RLSeg) which greatly boosts the robustness with simple but effective modifications. It is promising that our benchmark, comprehensive analysis, and observations can boost future research in robust LiDAR semantic segmentation for safety-critical applications.
translated by 谷歌翻译
In recent years, arbitrary image style transfer has attracted more and more attention. Given a pair of content and style images, a stylized one is hoped that retains the content from the former while catching style patterns from the latter. However, it is difficult to simultaneously keep well the trade-off between the content details and the style features. To stylize the image with sufficient style patterns, the content details may be damaged and sometimes the objects of images can not be distinguished clearly. For this reason, we present a new transformer-based method named STT for image style transfer and an edge loss which can enhance the content details apparently to avoid generating blurred results for excessive rendering on style features. Qualitative and quantitative experiments demonstrate that STT achieves comparable performance to state-of-the-art image style transfer methods while alleviating the content leak problem.
translated by 谷歌翻译